Skip to content

PDF stage 1.3 (part A): composite (Type0) fonts#533

Merged
andiwand merged 2 commits into
mainfrom
pdf-composite-cid-fonts
Jun 14, 2026
Merged

PDF stage 1.3 (part A): composite (Type0) fonts#533
andiwand merged 2 commits into
mainfrom
pdf-composite-cid-fonts

Conversation

@andiwand

@andiwand andiwand commented Jun 14, 2026

Copy link
Copy Markdown
Member

Stacked on #532 (stage 1.2). Targets pdf-encoding-to-unicode; retarget to main once #532 merges.

Stage 1.3, part A — composite (Type0/CID) fonts

The roadmap's stage 1.3 is composite fonts: Identity-H/V + predefined CJK CMaps, mapping code → CID → Unicode. Scanning the corpus reshaped the work: every Type0 font we have is /Identity-H and carries a /ToUnicode CMap, which the stage-1.1 multi-byte CMap path already handles. So this PR is the structural landing (part A); the heavy predefined-CJK-CMap data (part B) is deferred until there's a CJK fixture to validate it against.

Changes

  • parse_font detects /Subtype /Type0, walks /DescendantFonts[0] and records the descendant CIDFont's /CIDSystemInfo /Registry//Ordering on Font, and keeps the Type0 /Encoding (a code → CID CMap) out of the simple-font parse_encoding path — so Identity-H no longer trips the "unsupported /Encoding name" warning.
  • Font gains composite / cid_registry / cid_ordering.
  • Font::to_unicode: a composite font without a /ToUnicode now returns "no Unicode" instead of mis-splitting its multi-byte codes into byte-garbage through the single-byte identity fallback (sets up stage 1.5).
  • Tests: DocumentParser.composite_font_with_to_unicode and …_without_to_unicode_yields_no_unicode (inline mini-PDFs).
  • AGENTS.md: status, module layout, roadmap (1.3 split into part A done / part B next), tests, known gaps.

Notes

  • No reference-output changes: all Type0 fixtures already extract via /ToUnicode, so behavior is unchanged for the corpus; the new path only affects composite fonts that lack a /ToUnicode.
  • All 70 PDF unit tests pass locally (incl. the real-fixture end-to-end tests).

Part B (follow-up) needs

  • Go-ahead to vendor Adobe's cmap-resources + CID → Unicode tables as generated C++ (like the AGL in 1.2).
  • At least one CJK PDF fixture — the corpus has none.

@andiwand andiwand marked this pull request as ready for review June 14, 2026 19:51
Base automatically changed from pdf-encoding-to-unicode to main June 14, 2026 19:59
Recognize composite (Type0) fonts and drive their extraction through the
existing multi-byte /ToUnicode path (stage 1.1), which covers the whole
local corpus (every Type0 font is Identity-H + /ToUnicode).

- parse_font detects /Subtype /Type0, walks /DescendantFonts[0] and records
  the descendant CIDFont's /CIDSystemInfo /Registry//Ordering on Font, and
  keeps the Type0 /Encoding (a code -> CID CMap) out of the simple-font
  parse_encoding path — so Identity-H no longer trips the "unsupported
  /Encoding name" warning.
- Font gains composite/cid_registry/cid_ordering; Font::to_unicode returns
  "no Unicode" for a composite font lacking a /ToUnicode rather than
  mis-splitting its multi-byte codes through the single-byte identity
  fallback.
- Tests: composite_font_with_to_unicode and
  composite_font_without_to_unicode_yields_no_unicode.

Predefined CJK CMaps and the CID -> Unicode tables (part B) are deferred:
they are the heavy data chunk and the corpus has no CJK fixture to validate
against.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-composite-cid-fonts branch from 27c674a to 8320665 Compare June 14, 2026 20:28
Comment thread test/src/internal/pdf/pdf_document_parser.cpp Outdated
Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated
- parse_composite_font takes Font& instead of Font*
- the composite_font_mini_pdf test helper uses a /// doc comment

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-composite-cid-fonts branch from ef8a1f6 to 5fefb7f Compare June 14, 2026 20:44
@andiwand andiwand enabled auto-merge (squash) June 14, 2026 20:46
@andiwand andiwand disabled auto-merge June 14, 2026 21:53
@andiwand andiwand merged commit f0f19af into main Jun 14, 2026
10 of 11 checks passed
@andiwand andiwand deleted the pdf-composite-cid-fonts branch June 14, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant